Adam (“adaptive moment estimation”) is an optimizer for gradient descent. It was first proposed in Kingma and Ba (2014) and is discussed extensively, eg in Chaudhury 2024 and Ruder 2017.
Adam incorporates both momentum and adaptive learning rates. It is generally the default choice for most commercial Deep learning tasks.
Adam maintains two state vectors
Because the momentum terms
Finally, the new value of the parameters
where